Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/expect: fix data race #16173

Merged
merged 1 commit into from
Jul 13, 2023
Merged

pkg/expect: fix data race #16173

merged 1 commit into from
Jul 13, 2023

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Jul 4, 2023

Let's say there is command which outputs one line and exit with error.

There are three goroutines to acquire the lock:

  1. ep.read()
  2. ep.waitSaveExitErr()
  3. ep.Expect()

When ep.read goroutine reads the log but it doesn't acquire the lock in time, the ep.waitSaveExitErr acquires the lock and updates the exitErr. And then ep.Expect acquires lock but it doesn't see any log yet and then returns err.

It's hard to reproduce it in local. Add the extra sleep can reproduce it.

diff --git a/pkg/expect/expect.go b/pkg/expect/expect.go
index a512a3ce4..602bea73f 100644
--- a/pkg/expect/expect.go
+++ b/pkg/expect/expect.go
@@ -128,6 +128,7 @@ func (ep *ExpectProcess) tryReadNextLine(r *bufio.Reader) error {
        printDebugLines := os.Getenv("EXPECT_DEBUG") != ""
        l, err := r.ReadString('\n')

+       time.Sleep(10 * time.Millisecond)
        ep.mu.Lock()
        defer ep.mu.Unlock()

See it once in Github Action [1]. In order to fix it, the patch introduces readCloseCh to wait for ep.read to get all the data and retry it.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@tjungblu
Copy link
Contributor

tjungblu commented Jul 4, 2023

wow, great catch. I wonder if we can simplify the whole expect a little by serialization?
Let's say we would always wait for the process to exit first, then drain its pty output and then match for the function on that output.

I'm not entirely sure whether we have many cases that spawn a longer running task and require an "early" expect result for test/runtime reasons. TestGrpcProxyAutoSync seems to be one such candidate.
edit: ok seems it's also used that way in EtcdServerProcess 🐰

@fuweid
Copy link
Member Author

fuweid commented Jul 5, 2023

Let's say we would always wait for the process to exit first, then drain its pty output and then match for the function on that output.

I was thinking that case. But there are some cases to expect process is ready, like EtcdServerProcess. So we still need to check the ongoing log data.

func WaitReadyExpectProc(ctx context.Context, exproc *expect.ExpectProcess, readyStrs []string) error {
matchSet := func(l string) bool {
for _, s := range readyStrs {
if strings.Contains(l, s) {
return true
}
}
return false
}
_, err := exproc.ExpectFunc(ctx, matchSet)
return err
}

@tjungblu
Copy link
Contributor

tjungblu commented Jul 5, 2023

indeed, that's also what I found browsing for it. Alternatively we can also check the readiness probe instead? Anyway, I think that's a bigger change overall.

@fuweid
Copy link
Member Author

fuweid commented Jul 5, 2023

Alternatively we can also check the readiness probe instead? Anyway, I think that's a bigger change overall.

Agree. It's reasonable to use expect pkg for short-live process only. It needs ETCD server to expose more metrics in API to replace checking the logging. Maybe we can track this in issue for enhancement.

Comment on lines 197 to 198
// NOTE: The ep.read() goroutine needs the mutex so we have to drain the
// log without lock.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we have to drain the log without lock.

Could you describe where we drain the log without lock? I thought ep.read() always drains the log with the mutex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the unclear comment.

etcd/pkg/expect/expect.go

Lines 102 to 138 in 4364e75

func (ep *ExpectProcess) read() {
defer ep.wg.Done()
defer func(fpty *os.File) {
err := fpty.Close()
if err != nil {
// we deliberately only log the error here, closing the PTY should mostly be (expected) broken pipes
fmt.Printf("error while closing fpty: %v", err)
}
}(ep.fpty)
r := bufio.NewReader(ep.fpty)
for {
err := ep.tryReadNextLine(r)
if err != nil {
break
}
}
}
func (ep *ExpectProcess) tryReadNextLine(r *bufio.Reader) error {
printDebugLines := os.Getenv("EXPECT_DEBUG") != ""
l, err := r.ReadString('\n')
ep.mu.Lock()
defer ep.mu.Unlock()
if l != "" {
if printDebugLines {
fmt.Printf("%s (%s) (%d): %s", ep.cmd.Path, ep.cfg.name, ep.cmd.Process.Pid, l)
}
ep.lines = append(ep.lines, l)
ep.count++
}
// we're checking the error here at the bottom to ensure any leftover reads are still taken into account
return err
}

If ep.read() just reads last line from pty, it will try to get lock and then append the log into ep.lines. So, if the ep.Expect waits the ep.readCloseCh channel close with the lock, the ep.read() doesn't have a chance to get the lock and then the ep.readCloseCh won't be closed until ep.Expect releases the lock. It's deadlock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

// NOTE: The ep.read() goroutine needs the mutex so we have to drain the
// log without lock.

I guess ep.ExpectFunc does not drain the log which causes my confusion.

How about updating to?

// NOTE: we wait readCloseCh for ep.read() to complete draining the log before acquring the lock. 

Copy link
Member

@chaochn47 chaochn47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one nit, Thanks @fuweid!

Comment on lines 197 to 198
// NOTE: The ep.read() goroutine needs the mutex so we have to drain the
// log without lock.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

// NOTE: The ep.read() goroutine needs the mutex so we have to drain the
// log without lock.

I guess ep.ExpectFunc does not drain the log which causes my confusion.

How about updating to?

// NOTE: we wait readCloseCh for ep.read() to complete draining the log before acquring the lock. 

line := ep.lines[i]
i++
if f(line) {
return line, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any consumer of ExpectFunc relies on the exit error, so return nil should be fine.

@fuweid
Copy link
Member Author

fuweid commented Jul 8, 2023

Thanks @chaochn47 ! updated

@fuweid
Copy link
Member Author

fuweid commented Jul 12, 2023

ping @ahrtr @serathius

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a very minor comment.

Thanks @fuweid

@@ -69,6 +71,7 @@ func NewExpectWithEnv(name string, args []string, env []string, serverProcessCon
},
}
ep.cmd = commandFromConfig(ep.cfg)
ep.readCloseCh = make(chan struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, we can just put this line into the struct initialization,

	ep = &ExpectProcess{
		cfg: expectConfig{
			name: serverProcessConfigName,
			cmd:  name,
			args: args,
			env:  env,
		},
		readCloseCh: make(chan struct{}),
	}

Let's say there is command which outputs one line and exit with error.

There are three goroutines to acquire the lock:

1. ep.read()
2. ep.waitSaveExitErr()
3. ep.Expect()

When ep.read goroutine reads the log but it doesn't acquire the lock in
time, the ep.waitSaveExitErr acquires the lock and updates the
`exitErr`. And then ep.Expect acquires lock but it doesn't see any log
yet and then returns err.

It's hard to reproduce it in local. Add the extra sleep can reproduce it.

```diff
diff --git a/pkg/expect/expect.go b/pkg/expect/expect.go
index a512a3ce4..602bea73f 100644
--- a/pkg/expect/expect.go
+++ b/pkg/expect/expect.go
@@ -128,6 +128,7 @@ func (ep *ExpectProcess) tryReadNextLine(r *bufio.Reader) error {
        printDebugLines := os.Getenv("EXPECT_DEBUG") != ""
        l, err := r.ReadString('\n')

+       time.Sleep(10 * time.Millisecond)
        ep.mu.Lock()
        defer ep.mu.Unlock()
```

See it once in Github Action [1]. In order to fix it, the patch introduces
`readCloseCh` to wait for ep.read to get all the data and retry it.

[1]: etcd-io#16137 (comment)

Signed-off-by: Wei Fu <[email protected]>
@ahrtr ahrtr merged commit 1cf49e5 into etcd-io:main Jul 13, 2023
@fuweid fuweid deleted the fix-datarace-in-expect branch July 13, 2023 09:24
@jmhbnz jmhbnz mentioned this pull request Sep 25, 2023
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants